Progress Memo 2

Final Project
Data Science 1 with R (STAT 301-1)

Author

Nikole Montero Cervantes

Published

November 29, 2023

Introduction

Data Overview

This dataset was acquired through scraping TripAdvisor (TA), a well-known travel website, for its restaurant information. The dataset contains a pool of 1,083,397 restaurants across European countries. There are 42 variables, among these variables, 25 are categorical and 17 are numericals. The raw datasets for Europe’s largest cities were then carefully selected and combined for further examination.

It is important to note that this dataset comprises only those restaurants registered in the TripAdvisor database. Thus, it might not encompass all the restaurants within a city because the dataset relies solely on the TripAdvisor database.

Cleaning the data

In the process of cleaning the data, various essential string manipulation, functions and transformation techniques were employed using the dplyr and stringr packages in R. The dataset underwent a series of refinements to enhance its tidiness and facilitate downstream analyses. Key steps in the cleaning process include:

• Variable Renaming

• Creating and Modifying Variables

• Handling Categorical Data

• Text Processing

• List Manipulation

• Numeric Extraction

• Data Filtering and Handling Missing Values

Starting of EDA

Univariate Analysis

In order to find patterns or unusual trends, I started analyzing at each variable in the dataset.

For Restaurants

Considering there are too many observations, to facilitate my exploration process I decided to look at the 10 most common restaurants.

Through this bar plot, it is possible to see that there are a number of restaurants for the same restaurant name. Thus, looking back at the data I realized that even though those restaurants have the same name, there are all in different cities. Taking as an example, Flunch:

restaurant_name city
Flunch Franconville
Flunch Mers-les-Bains
Flunch Villebon-sur-Yvette
Flunch Le Quesnoy
Flunch Strasbourg
Flunch Bordeaux
Flunch Pau
Flunch Clermont-Ferrand
Flunch Tours
Flunch Besancon
Flunch Nantes
Flunch Poitiers
Flunch Avignon
Flunch Antibes
Flunch Roanne City
Flunch Epagny
Flunch Moulins
Flunch Macon
Flunch Montbeliard
Flunch Thionville
Flunch Boulogne-sur-Mer
Flunch Cholet
Flunch Amiens
Flunch Manosque
Flunch Vitrolles
Flunch Le Pontet
Flunch Saint-Jean-de-la-Ruelle
Flunch Herouville-Saint-Clair
Flunch Pertuis
Flunch Noyelles-Godault
Flunch Bonneuil-sur-Marne
Flunch Charleville-Mezieres
Flunch Chambery

Thus, I realized that those restaurants conformed a chain and that’s why there a more than one of them for those restaurants. Something particular interesting is that all the top restaurants chains are French. The restaurant chain with the highest amount of restaurants is Leon de Bruxelles.

For Average Rating

Through this plot it is seen the European restaurants in those 31 different countries on TripAdvisor, have a high rating, approximately between 4 to 4.8. This could suggest that the average quality offered in European restaurants is really good. This would be deeper study in the multivariate section.

For the Open Days Per Week

In this plot it is possible to see that most of the restaurants are open during the seven days of the week. That is followed by six and five days per week. That makes sense since restaurants should generally be open for five days or more in order to make profit.

However, there are some restaurants that are open for 4 days or less, which is atypical to see. The impact of this low openings amount would be explored in the multivariate section.

For Country

This plot displays the number of restaurants per country. France has the highest number of restaurants in this dataset, which could potentially explain why the top 10 restaurants chain are French. Croatia and Finland are the countries with the least number of restaurants on TripAdvisor. France will be explore deeper in a later section.

For Average Price

This histogram is right-skewed, with a mode around 20 to 30 euros. This could indicate that the majority of European restaurants that appear on TripAdvisor are affordable and generally do not exceed 50 euros. However, there are a some exceptions, which are seen through the outlines in the boxplot with prices ranging from 100 euros to 500 euros.

For Price Level

It is evident in this bar plot that most of the restaurants are mid-range, aligning with what was observed in the average price plot above. This reinforces the idea that the food offered in the majority of the restaurants in this dataset is affordable and potentially budget-friendly.

For special Diets

This plot shows that most of the restaurants on TripAdvisor do not offer special diets in their menus. However, there is some presence of vegetarian options. There is also a possibility that, for some restaurants, it was unknown, so it was registered as if they do not offer special diets. Thus, the impact of special diets can be inaccurate, not meaningful for this EDA.

For Cuisines

In this plot it is seen that most of the restaurants, more than 10000 restaurants, offer a European cuisine. This make sense, since the restaurants I am exploring are located in different European cities.

There is a moderate presence of restaurants, around 1875 ones, that work as bars too. Asian cuisine is also offered by around 1250 restaurants. African and North American cuisines have a lower presence in the menus of the European restaurants. Fusion and South American cuisine are barely offer in those restaurants. Oceania cuisine has the lowest presence in the restaurants within these database.

Multivariate Analysis

Location of the restaurants

Through this latitude vs longitude plot, it is appreciated that most restaurants are located in France. This reinforces the univariate analysis that indicated France having the highest amount of restaurants in the dataset.

Food Top and Bottom Ratings

Since there are a lot observations, the plot will be complicated to read. Thus, to make the analysis more comprehensive, I decided to narrow the observations studied. Since most of the restaurants offer a cuisine in Europe, I decided to explore those restaurants to make the EDA more meaningful.

This filtered dataset will be used to explore the food, service and value ratings in this section.

In this plot it is seen that the top 20 restaurants posses a a food rating of 5 out of 5. This means that the quality of the European cuisine is not only affordable, which was drawn from out previous section analysis, but also really tasteful.

Another interesting finding is that these restaurants with the top food ratings are French, which links with the overall trend of the high performance and presence of restaurants in France.

In this plot, it is appreciated that the most of the restaurants at the bottom, posses a low food rate of 2.0. There is a slightly higher food rate of 2.5 from a restaurant from the Flunch chain. The lowest food rate is 1.5 from Don & Donna.

restaurant_name country food_rating avg_price price_level
Don & Donna Greece 1.5 57.5 expensive
Flunch France 2.0 15.0 cheap
Flunch France 2.0 15.5 cheap
Flunch France 2.5 15.5 cheap

It is interesting that the Don & Donna restaurant located in Greece, despite a low food rating, their price level is still mark as expensive. While, the French restaurant chain like Flunch, with food rating between 2.5 and 5, their price level is usually cheap, with a price around 15 euros.

Service Top and Bottom Ratings

This plot shows that the restaurants that has the highest service rating 5 out of 5 are French.

This plot shows the restaurants with the lowest service rating. The most common lowest rating is 2.5, followed by 2.0. The lowest service rating belongs to Don & Donna, which also has the lowest food rating as seen previously.

The Flunch chain restaurant appears again, meaning that they do not only have the a low food rating, but also a low service rating.

Value Top and Bottom Ratings

Through this plot, it is seen that the restaurants that has the highest value rating, 5 out of 5, are French.

This plot shows that the restaurants with the lowest value rating is mostly 2. Something particular from this plot is that the Don & Donna restaurant appears again at the bottom. Looking at all the ratings of Don & Donna as explored before:

restaurant_name food_rating service_rating value_rating avg_price price_level
Don & Donna 1.5 1.5 1 57.5 expensive

Thus it is possible to infer that the restaurant, Don & Donna is the worst one in this dataset base on the food, service and value rating. Still, it is interesting to see that even though their rating is bad, their prices are still expensive around 50 euros. This lead to think that maybe Greek restaurants are usually expensive regardless of their rating.

Exploring France a bit deeper